• DOMAIN: Semiconductor manufacturing process
• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/ variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified
• DATA DESCRIPTION: sensor-data.csv : (1567, 592) The data consists of 1567 examples each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point
• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not
#!pip install imblearn
# importing all the required packages
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import time
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency
from sklearn.preprocessing import OneHotEncoder,LabelEncoder,MinMaxScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report,accuracy_score,recall_score,f1_score,precision_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore")
from imblearn.over_sampling import SMOTE
import pickle
%matplotlib inline
plt.rcParams['figure.figsize'] = (11,8)
#importing dataset
data = pd.read_csv("D:/Datasets/FMT/signal-data.csv")
data.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 73.8432 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | NaN | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
5 rows × 592 columns
data.shape
(1567, 592)
# Missing value treatment.
null_df = pd.DataFrame(data.isnull().sum().sort_values(ascending=False))
drop_columns = list(null_df[null_df[0] >= 200].index)
data1 = data.drop(drop_columns,axis = 1)
def change_rarget_values(x):
if x == -1:
return 0
else:
return x
data1['Pass/Fail'] = data1['Pass/Fail'].apply(lambda x : change_rarget_values(x))
data1.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | 14.9509 | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | 0 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 10.9003 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | 0 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 9.2721 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 8.5831 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | 0 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | 10.9698 | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | 0 |
5 rows × 540 columns
# now we will use simple imputer to fill the missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
scaler = MinMaxScaler()
model = RandomForestClassifier()
X = data1.drop(['Pass/Fail','Time'],axis =1)
y = data1['Pass/Fail']
X_imputed = pd.DataFrame(imputer.fit_transform(X),columns=X.columns)
X_imputed.dropna()
X_imputed.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 576 | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | 0.0162 | ... | 1.6765 | 14.9509 | 0.5005 | 0.0118 | 0.0035 | 2.3630 | 0.0205 | 0.0148 | 0.0046 | 71.9005 |
| 1 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | -0.0005 | ... | 1.1065 | 10.9003 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 |
| 2 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | 0.0041 | ... | 2.0952 | 9.2721 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 |
| 3 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | -0.0124 | ... | 1.7585 | 8.5831 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 |
| 4 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | -0.0031 | ... | 1.6597 | 10.9698 | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 |
5 rows × 538 columns
X_imputed.columns
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'576', '577', '582', '583', '584', '585', '586', '587', '588', '589'],
dtype='object', length=538)
print('Values counts for tatrget variable : \n',y.value_counts())
Values counts for tatrget variable : 0 1463 1 104 Name: Pass/Fail, dtype: int64
#let us see the correllation betwwen vairables and we will drop correletaled
plt.figure(figsize = (20,15))
corr = X_imputed.corr()
#annot = False
sns.heatmap(corr,square = True,cmap = 'Greens',linecolor= 'black')
plt.show()
print(X_imputed.corr().columns[:])
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'576', '577', '582', '583', '584', '585', '586', '587', '588', '589'],
dtype='object', length=538)
#Remove the highly collinear features from data
def remove_collinear_features(x, threshold):
'''
Objective:
Remove collinear features in a dataframe with a correlation coefficient
greater than the threshold. Removing collinear features can help a model
to generalize and improves the interpretability of the model.
Inputs:
x: features dataframe
threshold: features with correlations greater than this value are removed
Output:
dataframe that contains only the non-highly-collinear features
'''
# Calculate the correlation matrix
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterate through the correlation matrix and compare correlations
for i in iters:
for j in range(i+1):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = abs(item.values)
# If correlation exceeds the threshold
if val >= threshold:
# Print the correlated features and the correlation value
# print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(col.values[0])
# Drop one of each pair of correlated columns
drops = set(drop_cols)
x = x.drop(columns=drops)
return x
X_imputed2 = remove_collinear_features(X_imputed,0.70)
#let us see the correllation betwwen vairables and we will drop correletaled
plt.figure(figsize = (20,15))
corr = X_imputed2.corr()
#annot = False
sns.heatmap(corr,square = True,cmap = 'Greens')
plt.show()
print(X_imputed.shape)
print(X_imputed2.shape)
(1567, 538) (1567, 309)
num_cols = X_imputed2.select_dtypes(include = [np.number]).columns
cat_cols = X_imputed2.select_dtypes(exclude = [np.number]).columns
num_cols
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'558', '559', '570', '571', '572', '582', '583', '586', '587', '589'],
dtype='object', length=309)
cat_cols
Index([], dtype='object')
sns.countplot(y)
plt.show()
plt.figure(figsize=(30,25))
for i,c in enumerate(num_cols):
plt.subplot(31,10,i+1)
sns.boxplot(x = c ,data = X_imputed2)
plt.title('BoxPlot for '+c)
plt.tight_layout()
plt.show()
# now we will drop all that columnse having variance less than threshold
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_norm = pd.DataFrame(normalizer.transform(X_imputed2), columns = X_imputed2.columns)
X_norm.shape
(1567, 309)
selector = VarianceThreshold()
selector.fit(X_norm)
mask = selector.get_support()
columns = X_norm.columns
selected_cols = columns[mask]
n_features2 = len(selected_cols)
print(f'remaining features: {n_features2}')
remaining features: 197
X_var = pd.DataFrame(selector.transform(X_imputed2), columns = selected_cols)
X_var.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 558 | 559 | 570 | 571 | 572 | 582 | 583 | 586 | 587 | 589 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | 0.0162 | ... | 1.0344 | 0.4385 | 533.8500 | 2.1113 | 8.95 | 0.5005 | 0.0118 | 0.0205 | 0.0148 | 71.9005 |
| 1 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | -0.0005 | ... | 0.9634 | 0.1745 | 535.0164 | 2.4335 | 5.92 | 0.5019 | 0.0223 | 0.0096 | 0.0201 | 208.2045 |
| 2 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | 0.0041 | ... | 1.5021 | 0.3718 | 535.0245 | 2.0293 | 11.21 | 0.4958 | 0.0157 | 0.0584 | 0.0484 | 82.8602 |
| 3 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | -0.0124 | ... | 1.1613 | 0.7288 | 530.5682 | 2.0253 | 9.33 | 0.4990 | 0.0103 | 0.0202 | 0.0149 | 73.8432 |
| 4 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | -0.0031 | ... | 0.9778 | 0.2156 | 532.0155 | 2.0275 | 8.83 | 0.4800 | 0.4766 | 0.0202 | 0.0149 | 73.8432 |
5 rows × 197 columns
# now we will see which have More correleation with Target and remove all non corelated
X_var_corr = X_var
X_var_corr['Target'] = y
var_corr = X_var_corr.corr()
no_corr_columns = var_corr.loc[['Target'],abs(var_corr['Target'] < 0.05)].columns
#var_corr[var_corr['Target'] < 0.05]
X_var_corr = X_var_corr.drop(columns = no_corr_columns).drop(columns = ['Target'])
no_corr_columns
Index(['0', '1', '2', '3', '4', '6', '7', '8', '9', '10',
...
'558', '559', '570', '571', '572', '582', '583', '586', '587', '589'],
dtype='object', length=166)
X_var_corr
| 5 | 21 | 32 | 33 | 38 | 42 | 49 | 56 | 58 | 59 | ... | 133 | 159 | 160 | 166 | 183 | 200 | 210 | 460 | 510 | 511 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100.0 | -5419.00 | 83.3971 | 9.5126 | 86.9555 | 70.0 | 1.0 | 0.9317 | 4.7057 | -1.7264 | ... | 1000.7263 | 1017.0 | 967.0 | 2.0 | 16.713 | 10.30 | 0.0772 | 29.9394 | 64.6707 | 0.0000 |
| 1 | 100.0 | -5441.50 | 84.9052 | 9.7997 | 87.5241 | 70.0 | 1.0 | 0.9324 | 4.6820 | 0.8073 | ... | 998.1081 | 568.0 | 59.0 | 2.2 | 16.358 | 8.02 | 0.0566 | 40.4475 | 141.4365 | 0.0000 |
| 2 | 100.0 | -5447.75 | 84.7569 | 8.6590 | 84.7327 | 70.0 | 1.0 | 0.9139 | 4.5873 | 23.8245 | ... | 998.4440 | 562.0 | 788.0 | 2.1 | 22.912 | 16.73 | 0.0339 | 32.3594 | 240.7767 | 244.2748 |
| 3 | 100.0 | -5468.25 | 84.9105 | 8.6789 | 86.6867 | 70.0 | 1.0 | 0.9139 | 4.5873 | 24.3791 | ... | 980.4510 | 859.0 | 355.0 | 1.7 | 22.562 | 13.56 | 0.1248 | 27.6824 | 113.5593 | 0.0000 |
| 4 | 100.0 | -5476.25 | 86.3269 | 8.7677 | 86.1468 | 70.0 | 1.0 | 0.9298 | 4.6414 | -12.2945 | ... | 993.1274 | 699.0 | 283.0 | 3.9 | 37.715 | 19.77 | 0.0915 | 30.8924 | 148.0663 | 0.0000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1562 | 100.0 | -5418.75 | 83.8405 | 8.7164 | 86.3672 | 70.0 | 1.0 | 0.9204 | 4.4941 | 2.8182 | ... | 997.7594 | 1280.0 | 334.0 | 5.9 | 14.104 | 12.71 | 0.0797 | 52.6790 | 53.1915 | 235.7895 |
| 1563 | 100.0 | -6408.75 | 84.0623 | 8.9607 | 86.4051 | 70.0 | 1.0 | 0.9255 | 4.5305 | -3.3555 | ... | 1015.7622 | 504.0 | 94.0 | 2.7 | 30.347 | 24.47 | 0.0797 | 18.5401 | 29.4372 | 700.0000 |
| 1564 | 100.0 | -5153.25 | 85.8638 | 8.1728 | 86.3506 | 70.0 | 1.0 | 0.9353 | 4.6118 | 1.1664 | ... | 1004.0500 | 1178.0 | 542.0 | 3.2 | 20.963 | 19.37 | 0.0797 | 37.7546 | 54.8330 | 0.0000 |
| 1565 | 100.0 | -5271.75 | 84.5602 | 9.1930 | 86.3130 | 70.0 | 1.0 | 0.9207 | 4.5509 | 4.4682 | ... | 999.4826 | 1740.0 | 252.0 | 2.2 | 13.879 | 9.49 | 0.0797 | 29.2827 | 78.4993 | 456.4103 |
| 1566 | 100.0 | -5319.50 | 83.3424 | 8.7786 | 86.4039 | 70.0 | 1.0 | 0.9187 | 4.5479 | 1.8718 | ... | 1004.0500 | 763.0 | 304.0 | 3.9 | 18.859 | 16.09 | 0.0797 | 17.0933 | 75.8621 | 317.6471 |
1567 rows × 31 columns
#using pipeline
from sklearn.pipeline import Pipeline
p_line = Pipeline([('imputer',SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler',MinMaxScaler()),('model',RandomForestClassifier())])
p_line.fit(X, y)
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', MinMaxScaler()),
('model', RandomForestClassifier())])
model = RandomForestClassifier()
X_scaled = scaler.fit_transform(X_var_corr)
model.fit(X_scaled,y)
imp_features_df = pd.DataFrame(model.feature_importances_,index = X_var_corr.columns,columns = ['Values'])
imp_features_df = imp_features_df.sort_values(by='Values',ascending=False)
imp_features_df.sort_values(by = 'Values',ascending= False)
| Values | |
|---|---|
| 59 | 0.066875 |
| 64 | 0.060815 |
| 166 | 0.050538 |
| 21 | 0.050471 |
| 210 | 0.047304 |
| 63 | 0.044691 |
| 38 | 0.044393 |
| 103 | 0.043619 |
| 510 | 0.042113 |
| 200 | 0.040102 |
| 121 | 0.038593 |
| 460 | 0.037035 |
| 159 | 0.036878 |
| 100 | 0.034926 |
| 183 | 0.033624 |
| 79 | 0.033521 |
| 33 | 0.033392 |
| 160 | 0.033223 |
| 126 | 0.032740 |
| 58 | 0.030823 |
| 511 | 0.029831 |
| 133 | 0.029460 |
| 129 | 0.028944 |
| 32 | 0.027704 |
| 56 | 0.023187 |
| 95 | 0.016018 |
| 114 | 0.009179 |
| 69 | 0.000000 |
| 49 | 0.000000 |
| 42 | 0.000000 |
| 5 | 0.000000 |
model = RandomForestClassifier()
X_scaled_all = scaler.fit_transform(X_imputed)
model.fit(X_scaled_all,y)
imp_features_df_all = pd.DataFrame(model.feature_importances_,index = X.columns,columns = ['Values'])
imp_features_df_all = imp_features_df_all.sort_values(by='Values',ascending=False)
imp_features_df_all.sort_values(by = 'Values',ascending= False)
| Values | |
|---|---|
| 59 | 0.012984 |
| 64 | 0.012477 |
| 153 | 0.009375 |
| 65 | 0.008800 |
| 348 | 0.007168 |
| ... | ... |
| 512 | 0.000000 |
| 513 | 0.000000 |
| 514 | 0.000000 |
| 515 | 0.000000 |
| 149 | 0.000000 |
538 rows × 1 columns
(imp_features_df_all['Values'] <= 0.000).sum()
122
X_train,X_test,y_train,y_test = train_test_split(X_scaled_all,y,test_size = 0.2,random_state = 1)
from sklearn.linear_model import RidgeClassifier
for alpha in [0.1,0.2,0.3,0.4,0.5,1.0]:
ridge = RidgeClassifier(alpha= alpha)
ridge.fit(X_train,y_train)
train_score = ridge.score(X_train,y_train)
test_score = ridge.score(X_test,y_test)
print(f'{alpha} : {train_score}')
print(f'{alpha} : {test_score}')
print('*'*30)
0.1 : 0.945730247406225 0.1 : 0.9267515923566879 ****************************** 0.2 : 0.9441340782122905 0.2 : 0.9299363057324841 ****************************** 0.3 : 0.9441340782122905 0.3 : 0.9331210191082803 ****************************** 0.4 : 0.9433359936153233 0.4 : 0.9331210191082803 ****************************** 0.5 : 0.942537909018356 0.5 : 0.9331210191082803 ****************************** 1.0 : 0.9409417398244214 1.0 : 0.9363057324840764 ******************************
ridge = RidgeClassifier(alpha= 1.0)
ridge.fit(X_train,y_train)
(np.round(ridge.coef_,2) == 0.0).sum()
128
# with only 32 cols
X_train,X_test,y_train,y_test = train_test_split(X_var_corr,y,test_size = 0.2,random_state = 1)
for alpha in [0.1,0.2,0.3,0.4,0.5,1.0]:
ridge = RidgeClassifier(alpha= alpha)
ridge.fit(X_train,y_train)
train_score = ridge.score(X_train,y_train)
test_score = ridge.score(X_test,y_test)
print(f'{alpha} : {train_score}')
print(f'{alpha} : {test_score}')
print('*'*30)
0.1 : 0.9329608938547486 0.1 : 0.9331210191082803 ****************************** 0.2 : 0.9329608938547486 0.2 : 0.9331210191082803 ****************************** 0.3 : 0.9329608938547486 0.3 : 0.9331210191082803 ****************************** 0.4 : 0.9329608938547486 0.4 : 0.9331210191082803 ****************************** 0.5 : 0.9329608938547486 0.5 : 0.9331210191082803 ****************************** 1.0 : 0.9337589784517158 1.0 : 0.9331210191082803 ******************************
ridge = RidgeClassifier(alpha= 1.0,)
ridge.fit(X_train,y_train)
print(np.round(ridge.coef_[0],3))
coeff_dict = {}
for col,coeff in zip(X_var_corr.columns,ridge.coef_[0]):
coeff_dict[col] = coeff
[ 0. 0. 0.001 0.009 -0.061 0. 0. 0.026 0.136 0.011 -0.007 0.015 0. -0.453 -0. 0.01 0.037 0.087 0.472 0.092 0.029 -0.001 -0. 0. 0.035 0.002 0.004 0.412 0.002 0.001 0. ]
ridge_drop_cols= []
for key in coeff_dict.keys():
if coeff_dict[key] == 0.:
ridge_drop_cols.append(key)
ridge_drop_cols.append('114')
X_var_corr = X_var_corr.drop(columns = ridge_drop_cols)
model = RandomForestClassifier()
X_scaled = scaler.fit_transform(X_var_corr)
model.fit(X_scaled,y)
imp_features_df = pd.DataFrame(model.feature_importances_,index = X_var_corr.columns,columns = ['Values'])
imp_features_df = imp_features_df.sort_values(by='Values',ascending=False)
imp_features_df.sort_values(by = 'Values',ascending= False)
| Values | |
|---|---|
| 64 | 0.067541 |
| 59 | 0.065978 |
| 21 | 0.050146 |
| 166 | 0.049623 |
| 103 | 0.046548 |
| 38 | 0.046491 |
| 121 | 0.044358 |
| 63 | 0.042473 |
| 210 | 0.042149 |
| 160 | 0.040386 |
| 460 | 0.040195 |
| 510 | 0.039279 |
| 200 | 0.037214 |
| 33 | 0.037128 |
| 159 | 0.036449 |
| 183 | 0.034095 |
| 100 | 0.033210 |
| 129 | 0.032316 |
| 79 | 0.030776 |
| 58 | 0.030607 |
| 126 | 0.030214 |
| 32 | 0.029205 |
| 133 | 0.028789 |
| 511 | 0.026122 |
| 56 | 0.025597 |
| 95 | 0.013111 |
plt.figure(figsize=(30,25))
for i,c in enumerate(X_var_corr.columns):
plt.subplot(5,6,i+1)
sns.boxplot(x = c ,data = X_imputed2)
plt.title('BoxPlot for '+c)
plt.tight_layout()
plt.show()
plt.figure(figsize=(30,25))
for i,c in enumerate(X_var_corr.columns):
plt.subplot(5,6,i+1)
sns.kdeplot(x = c ,data = X_imputed2)
plt.title('BoxPlot for '+c)
plt.tight_layout()
plt.show()
sns.pairplot(X_var_corr)
plt.show()
X_var_corr_target = X_var_corr.copy()
X_var_corr_target['Target'] = y
plt.figure(figsize=(30,25))
for i,c in enumerate(X_var_corr.columns):
plt.subplot(5,6,i+1)
sns.boxplot(y = c ,x = 'Target',data = X_var_corr_target)
plt.title('BoxPlot for '+c)
plt.tight_layout()
plt.show()
plt.figure(figsize=(30,25))
for i,c in enumerate(X_var_corr.columns):
plt.subplot(5,6,i+1)
sns.kdeplot(x = c ,hue = 'Target',data = X_var_corr_target)
plt.title('BoxPlot for '+c)
plt.tight_layout()
plt.show()
sns.pairplot(data = X_var_corr_target,hue = 'Target')
plt.show()